AI Site Reliability Engineer

  • Singapore
  • Contract
  • Mon May 11 08:27:58 2026
  • 41901


AI Site Reliability Engineer

What’s on Offer:

  • Industry: Consulting
  • Location: Singapore
  • 12 months contract role (with the possibility of extension)
  • Competitive Compensation

Job Summary:
We are seeking for AI Site Reliability Engineer to join our client’s core AI Platform Engineering team to embed reliability patterns directly into services as they are designed and built. You will also partner with the AI Security Engineer to ensure every system is secure by design, shifting security left into architecture, CI/CD pipelines, and operational practices. This role sits at the heart of the firm’s enterprise AI strategy, supporting mission-critical AI systems used across the organisation.
Job Description:
  • Own the enterprise AI gateway — Be the accountable owner for the LLM gateway and MCP gateway: architecture, SLOs (availability, latency, throughput), capacity planning, incident response, and roadmap.
  • Work on the SRE squads — Define on-call rotations, escalation paths, quality standards,
    and vendor performance expectations.
  • Set the reliability standard — Define and enforce SLOs, error budgets, and error-budget
    policies across all AI products. When budgets burn, you make the call: freeze features and
    fix reliability.
  • Harden the platform as we build it — Work with the core platform engineering squad to
    embed reliability patterns — circuit breakers, retry policies, graceful degradation, health
    checks, deployment safety — into services from the first sprint.
  • Architect for AI-specific failure modes — Design mitigation strategies for non-
    deterministic outputs, long-tail model latency, agent loops, cascading failures, and LLM
    provider outages.
  • Partner on secure-by-design — Work with the AI Security Engineer to embed threat modelling, zero-trust controls, prompt-injection defences, and content-safety guardrails
    into architecture and operations.
  • Eliminate toil — Automate incident detection, runbook execution, capacity scaling, deployment pipelines, and onboarding flows. Measure toil and reduce it over time.
  • Build operational excellence — Establish blameless postmortem culture, incident command structure, on-call health practices, and operational review cadences
  • Raise the engineering bar — Act as SME on production engineering best practices: testing strategies (chaos, canary, load, red-team), deployment safety (blue-green, progressive rollout), observability standards, and code-review discipline
Job Requirements:
  • 3–8 years of experience in Site Reliability Engineering, Platform Engineering, or Software Engineering with strong production ownership
  • Deep expertise in SRE principles including SLOs, SLIs, error budgets, incident management, toil reduction, capacity planning, and chaos engineering
  • Proven experience managing SRE, platform operations, or production engineering squads, including augmented/vendor teams
  • Strong track record of owning critical API gateways, platform services, or high-throughput infrastructure with stringent availability targets (99.9%+ uptime)
  • Hands-on experience supporting AI/ML workloads in production environments, with understanding of: non-deterministic model behaviour, latency variance, token-budget management agent loops and LLM provider outages.
  • Strong cloud infrastructure knowledge, preferably AWS: VPC architecture, EKS / Kubernetes, Load balancing, Auto-scaling, Multi-AZ / multi-region architecture
  • Experience with observability platforms such as Datadog, Grafana, OpenTelemetry, PagerDuty, or equivalent
  • Experience with Infrastructure-as-Code tools such as Terraform or CDK
  • Experience building CI/CD pipelines using GitHub Actions, ArgoCD, or equivalent
  • Strong working proficiency in Python for tooling, debugging, and production issue resolution

Nice to Have:
  • Experience operating enterprise-scale LLM gateways, API gateways, or service meshes such as Kong, Envoy, or AWS API Gateway
  • Familiarity with MCP (Model Context Protocol) and MCP server fleet operations
  • Background in AI Security including prompt injection defence, content safety filtering, output grounding, and jailbreak mitigation
  • Experience with durable execution frameworks such as Temporal or Inngest
  • Exposure to highly regulated environments such as financial services, banking, or enterprise- scale technology organisations